Background: Real-world data (RWD) plays an increasingly important role in healthcare. Leveraging derived variables, such as lines-of-therapy (LOTs) and treatment response outcomes, enables the deployment of decision-support tools and personalized care options. However, extracting patients' data represents a significant challenge as it is spread across hundreds of unstructured clinical notes. To maximize the accuracy and potential of RWD, dedicated healthcare professionals across multiple institutions need to perform rigorous chart reviews and directed extraction of clinical data points. This non-standardized process is costly and time-consuming, creating a bottleneck in the research pipeline. A scalable framework was developed to approach this challenge, an AI Agent with a doctor-in-the-loop is proposed to extract multiple myeloma (MM) lines of therapy and outcomes from electronic health records. This work summarizes the development of the AI agent and evaluates its performance.

Methods: An AI Agent with a doctor-in-the-loop was designed to extract MM lines of therapy and outcomes from unstructured medical records. A cohort of 94 patients diagnosed with MM since 2022 was randomly selected from the Healthtree Foundation Registry, the cohort documents were analyzed with the AI Agent and reviewed by clinical experts. First, each clinical note in a patient's chart was independently evaluated using OpenAI's o4-mini with high reasoning model to identify documents containing information relevant to LOT and outcome. Relevant notes were compiled into a single document for each patient, with unique identifiers retained to support source tracking. Gemini 2.5 PRO, selected for its 1 million reasoning token context window capacity, was used to process the aggregated documents to extract all observed lines of therapy variables, defined by five elements: medications, start and end dates, relevant procedures (e.g. transplant, CAR-T therapy or apheresis), International Myeloma Working Group (IMWG) treatment response outcomes, and treatment status (ongoing or completed). For each element, the model also returned the database IDs of supporting source documents. Finally, clinicians reviewed the output data in a structured interface, displaying the model's extracted values, and the original source document texts. Each element was verified, corrected if needed and model performance was evaluated by calculating element-level accuracy as the percentage of model-generated outputs that matched clinician-validated values.

Results: The AI agent accurately identified the correct number of LOTs in 89.4% of patients. Medications were correctly identified in 89.9% of patients, while procedures (e.g., transplant, CAR-T therapy) were correctly extracted in 86.6%. Treatment start and end dates were accurately assigned in 86.2% of patients. IMWG-outcome classification achieved 79.8% concordance with clinician assessment. Mean clinician review time was 13.6 minutes per patient, substantially lower than manual chart extraction. The document processing, powered by large language models (LLMs), incurred an average per-patient cost of $1.20, highlighting its scalability and cost-effectiveness.

Conclusions: This study demonstrates the feasibility of combining frontier LLMs with doctor-in-the-loop validation for extraction of MM LOTs from unstructured clinical documents. The high fidelity across the treatment elements analyzed, combined with rapid clinician validation and document traceability, positions this approach as a scalable alternative to manual abstraction. The AI Agent addresses long-standing bottlenecks in RWD curation by offering a reliable, low-burden method for transforming free-text data into structured treatment trajectories. This framework has potential applications in trial eligibility matching, retrospective outcomes research, and real-world clinical decision support across EHR systems.

This content is only available as a PDF.
Sign in via your Institution